This notebook demonstrates model workflows from the tidymodels R package, using {targets} as a explanatory tool.

Disclaimer: The actual fitting and modeling in this notebook don’t represent best practices but rather serve to demonstrate workflows. In reality you would want to tune each of the models using cross-validation on the train set. Additionally, you’d want to define different recipes for each model type in the workflow_set() function.

At a high-level, a workflow object uses all or some of these elements:

  1. Pre-processing (recipe)
  2. Model
  3. Post-processing (not always applicable)

We will be fitting the following models: lm_model, rf_model, xgb_model

Let’s take a look at the overall pipeline:

tar_glimpse()
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
## ✓ broom     0.7.3      ✓ recipes   0.1.16
## ✓ dials     0.0.9      ✓ rsample   0.0.9 
## ✓ dplyr     1.0.2      ✓ tibble    3.1.1 
## ✓ ggplot2   3.3.2      ✓ tidyr     1.1.2 
## ✓ infer     0.5.3      ✓ tune      0.1.5 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.2 
## ✓ parsnip   0.1.5      ✓ yardstick 0.0.7 
## ✓ purrr     0.3.4      
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()

These are the individual steps and how long each step takes:

tar_meta() %>%
  select(name, seconds) %>%
  kableExtra::kable()
name seconds
ames_raw 0.218
ames_cleaned 0.022
ames_split 0.030
ames_train 0.003
ames_recipe 0.020
workflow 0.045
fitted_models 4.222
report 22.899
make_workflow_sets NA
make_ames_recipe NA
fit_models NA
lm_model 0.001
ames_metrics 1.186
xgb_model 0.005
ames_test 0.002
rf_model 0.001
models 0.001
model_names 0.001
predicted 0.222
pred_actual 0.002
eval 0.027

Descriptive Stats

Skimr

tar_read(ames_raw) %>%
  skim()
Data summary
Name Piped data
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
character 40
numeric 34
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MS_SubClass 0 1 11 41 0 16 0
MS_Zoning 0 1 5 28 0 7 0
Street 0 1 4 4 0 2 0
Alley 0 1 5 15 0 3 0
Lot_Shape 0 1 7 20 0 4 0
Land_Contour 0 1 3 3 0 4 0
Utilities 0 1 6 6 0 3 0
Lot_Config 0 1 3 7 0 5 0
Land_Slope 0 1 3 3 0 3 0
Neighborhood 0 1 6 39 0 28 0
Condition_1 0 1 4 6 0 9 0
Condition_2 0 1 4 6 0 8 0
Bldg_Type 0 1 5 8 0 5 0
House_Style 0 1 4 16 0 8 0
Overall_Cond 0 1 4 13 0 9 0
Roof_Style 0 1 3 7 0 6 0
Roof_Matl 0 1 4 7 0 8 0
Exterior_1st 0 1 5 7 0 16 0
Exterior_2nd 0 1 5 7 0 17 0
Mas_Vnr_Type 0 1 4 7 0 5 0
Exter_Cond 0 1 4 9 0 5 0
Foundation 0 1 4 6 0 6 0
Bsmt_Cond 0 1 4 11 0 6 0
Bsmt_Exposure 0 1 2 11 0 5 0
BsmtFin_Type_1 0 1 3 11 0 7 0
BsmtFin_Type_2 0 1 3 11 0 7 0
Heating 0 1 4 5 0 6 0
Heating_QC 0 1 4 9 0 5 0
Central_Air 0 1 1 1 0 2 0
Electrical 0 1 3 7 0 6 0
Functional 0 1 3 4 0 8 0
Garage_Type 0 1 6 19 0 7 0
Garage_Finish 0 1 3 9 0 4 0
Garage_Cond 0 1 4 9 0 6 0
Paved_Drive 0 1 5 16 0 3 0
Pool_QC 0 1 4 9 0 5 0
Fence 0 1 8 17 0 5 0
Misc_Feature 0 1 4 4 0 6 0
Sale_Type 0 1 2 5 0 10 0
Sale_Condition 0 1 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Lot_Frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
Lot_Area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
Year_Built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
Year_Remod_Add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
Mas_Vnr_Area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
BsmtFin_SF_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
BsmtFin_SF_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
Bsmt_Unf_SF 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
Total_Bsmt_SF 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
First_Flr_SF 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
Second_Flr_SF 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
Gr_Liv_Area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
Bsmt_Full_Bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
Bsmt_Half_Bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
Full_Bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
Half_Bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
Bedroom_AbvGr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
Kitchen_AbvGr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
TotRms_AbvGrd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
Fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
Garage_Cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
Garage_Area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
Wood_Deck_SF 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
Open_Porch_SF 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
Enclosed_Porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
Three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
Screen_Porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
Pool_Area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
Misc_Val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
Mo_Sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
Year_Sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
Sale_Price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
Longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
Latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇

Corrr

tar_read(ames_raw) %>%
  select_if(is.numeric) %>%
        correlate() %>%    # Create correlation data frame (cor_df)
       rearrange() %>%  # rearrange by correlations
       shave() %>%
        rplot()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

GGpairs - Only Predictors + Outcome

Before cleaning:

tar_read(ames_raw) %>%
  select(Sale_Price, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) %>%
  ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After cleaning:

tar_read(ames_cleaned) %>%
  select(Sale_Price, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) %>%
  ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Predictions

Let’s take a look at the predictions vs. observed for our models.

tar_read(pred_actual) %>%
  pivot_longer(c(-Sale_Price)) %>%
  ggplot(aes(Sale_Price, value, col = name)) + geom_point() + geom_abline(intercept =0 , slope = 1) + scale_x_continuous(limits = c(4.5, NA)) + scale_y_continuous(limits = c(4.5, NA)) + facet_grid(name ~ .) + labs(title = "Predicted vs. Actual for Each Model", x = "Actual", y  = "Predicted")
## Warning: Removed 3 rows containing missing values (geom_point).

Residuals vs. observed for each model:

tar_read(pred_actual) %>%
  pivot_longer(c(-Sale_Price)) %>%
  mutate(value = value - Sale_Price) %>%
  ggplot(aes(Sale_Price, value, col = name)) + geom_point() + geom_hline(yintercept = 0) + facet_grid(name ~.) + labs(title = "Actual vs. Residuals for Each Model", x = "Actual", y  = "Residual") 

Evaluation

tar_read(eval) %>%
  ggplot(aes(model, .estimate)) + geom_point() + facet_wrap(.metric ~., scales = "free") + coord_flip()

Session Information

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] GGally_2.1.1       corrr_0.4.3        skimr_2.1.3        workflowsets_0.0.2
##  [5] forcats_0.5.0      stringr_1.4.0      readr_1.4.0        tidyverse_1.3.0   
##  [9] yardstick_0.0.7    workflows_0.2.2    tune_0.1.5         tidyr_1.1.2       
## [13] tibble_3.1.1       rsample_0.0.9      recipes_0.1.16     purrr_0.3.4       
## [17] parsnip_0.1.5      modeldata_0.1.0    infer_0.5.3        ggplot2_3.3.2     
## [21] dplyr_1.0.2        dials_0.0.9        scales_1.1.1       broom_0.7.3       
## [25] tidymodels_0.1.2   tarchetypes_0.0.4  targets_0.1.0     
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-0   ellipsis_0.3.1     class_7.3-17       base64enc_0.1-3   
##  [5] fs_1.5.0           rstudioapi_0.13    farver_2.0.3       listenv_0.8.0     
##  [9] furrr_0.2.1        prodlim_2019.11.13 fansi_0.4.1        lubridate_1.7.9.2 
## [13] xml2_1.3.2         codetools_0.2-18   splines_4.0.3      knitr_1.30        
## [17] jsonlite_1.7.2     pROC_1.16.2        dbplyr_2.0.0       compiler_4.0.3    
## [21] httr_1.4.2         backports_1.2.1    assertthat_0.2.1   Matrix_1.2-18     
## [25] cli_2.2.0          visNetwork_2.0.9   htmltools_0.5.1.1  tools_4.0.3       
## [29] igraph_1.2.6       gtable_0.3.0       glue_1.4.2         Rcpp_1.0.5        
## [33] cellranger_1.1.0   DiceDesign_1.8-1   vctrs_0.3.6        iterators_1.0.13  
## [37] timeDate_3043.102  gower_0.2.2        xfun_0.20          globals_0.14.0    
## [41] ps_1.5.0           rvest_0.3.6        lifecycle_0.2.0    future_1.21.0     
## [45] MASS_7.3-53        TSP_1.1-10         ipred_0.9-9        hms_0.5.3         
## [49] parallel_4.0.3     RColorBrewer_1.1-2 yaml_2.2.1         rpart_4.1-15      
## [53] reshape_0.8.8      stringi_1.5.3      highr_0.8          foreach_1.5.1     
## [57] seriation_1.2-9    lhs_1.1.1          lava_1.6.8.1       repr_1.1.3        
## [61] rlang_0.4.10       pkgconfig_2.0.3    evaluate_0.14      lattice_0.20-41   
## [65] labeling_0.4.2     htmlwidgets_1.5.3  processx_3.4.5     tidyselect_1.1.0  
## [69] parallelly_1.22.0  plyr_1.8.6         magrittr_2.0.1     R6_2.5.0          
## [73] generics_0.1.0     DBI_1.1.0          pillar_1.6.0       haven_2.3.1       
## [77] withr_2.3.0        survival_3.2-7     nnet_7.3-14        modelr_0.1.8      
## [81] crayon_1.3.4       utf8_1.1.4         rmarkdown_2.6      grid_4.0.3        
## [85] readxl_1.3.1       data.table_1.13.4  callr_3.5.1        webshot_0.5.2     
## [89] reprex_0.3.0       digest_0.6.27      GPfit_1.0-8        munsell_0.5.0     
## [93] registry_0.5-1     viridisLite_0.3.0  kableExtra_1.3.1